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SEMIPARAMETRIC ESTIMATION OF MUTUAL INFORMATION AND 
RELATED CRITERIA : OPTIMAL TEST OF INDEPENDENCE 


AMOR KEZIOUi AND PHILIPPE REGNAULT^ 

Abstract. We derive independence tests by means of dependence measures thresholding in 
a semiparametric context. Precisely, estimates of (p-mutual informations, associated to (f- 
divergences between a joint distribution and the product distribution of its margins, are derived 
through the dual representation of (^-divergences. The asymptotic properties of the proposed 
estimates are established, including consistency, asymptotic distributions and large deviations 
principle. The obtained tests of independence are compared via their relative asymptotic 
Bahadur efficiency and numerical simulations. It follows that the proposed semiparametric 
Kullback-Leibler Mutual information test is the optimal one. On the other hand, the proposed 
approach provides a new method for estimating the Kullback-Leibler mutual information in 
a semiparametric setting, as well as a model selection procedure in large class of dependency 
models including semiparametric copulas. 

Keywords : Mutual informations, (/^-divergences, Fenchel Duality, Tests of independence, semi¬ 
parametric inference. 
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1. Introduction and notations 


Measuring the dependence between random variables has been a central aim of probability the¬ 
ory since its earliest developments. Classical examples of dependence measures are correlation 
measures of Pearson, Kendall or Spearman. While the hrst one focuses on linear relationship 
between real random variables, the two second ones measure the monotonic relationship be¬ 
tween variables taking values in ordered sets. Pure-independence measures, between variables 
X and Y taking values in general measurable spaces {X,Ax) and iy^Ay), can be dehned by 
considering any divergence between the joint distribution P of {X, Y) and the product distribu¬ 
tion of its margins P-*- := Pi C)P 2 , where Pi and P 2 are, respectively, the marginal distributions 
of X and Y. The most outstanding and widely used example of such dependence measures is 
the x^-divergence between P and P-*- dehned by 

X^(P,P^) ^ dP^(a:,|/), (1) 

where denotes the density of P with respect to (w.r.t.) P-*-. Note that, if P is a discrete 
distribution, i.e., if its support X x y := supp(P) is discrete (hnite or countably inhnite) set, 
then the above divergence writes 


X2(P,P^) 



{x,y)€Xxy 


(.Px,y PxPy') 
PxPy 


where P := {px,y){x,y), = {PxPy){x,y), with Px := Y.yPx,y and Py := Y.^Px,y Another classical 

example, associated to the Kullback-Leibler (KL) divergence between P and P-*-, is the well- 
known mutual information (MI) dehned by (see e.g. Cover and Thomas (2006)) 

/ dP dP 


(2) 
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which, in the case of discrete distributions, can be written under the form 


/al(P) 


{x,y)€Xxy 


Px,y 

PxPy 


We will call the above classical measures of dependence (1) and (2), respectively, y^-mutual in¬ 
formation (a^-MI) and KL-mutual information (KL-MI). When dealing with i.i.d. observations 
(Wi, Yi),..., (W„, Y„), of two random variables (X, Y), we may test the null hypothesis, that 
the variables X and Y are independent, by means of estimating such dependence measure and 
deciding to reject the null hypothesis of independence if the estimate is sufficiently far from 
zero; the classical y^-independence test is such a procedure : the corresponding test statistic 
(in the discrete-distribution case) is 


2n a"' ( P, ) = n 

{x,y)(iXxy 


{.Px,y PxPy') 
PxPy 


( 3 ) 


where P := {Px,y)(^^y^ and P-*- := (j>xPy)(^^y-^ are, respectively, the empirical versions of P = 
{Px,y)(x,y) and P-*- = {pxPy){x,y)- Likewise, to test the independence, we can consider as depen¬ 
dence measure the KL-MI and use the test statistic 

2nlKL(^)=2n ^ %^y\og^^. (4) 

The dependence measure can also be any other 9 ?-divergence between P and P-*-. The tests 
based on such dependence measures, including the a^”MI and KL-MI ones, have been exten¬ 
sively studied in the case of hnite-discrete distributions; see e.g. Pardo (2006) Chapter 8, and 
the references therein. When dealing with continuous distributions (or continuous random vari¬ 
ables), obviously, the above direct plug-in estimates (3) and (4), of the dependence measures 
(1) and (2), are not well dehned. Moreover, for countably-infinite discrete distributions, al¬ 
though the above estimates (3) and (4) remain well dehned, their limiting distributions are not 
accessible. Therefore, in the case of non hnite-discrete distributions, particularly, for the widely 
used KL-MI, other kind of estimates have been proposed and studied in the literature; see e.g. 
Moon et al. (1995) for a kernel density estimate, Kraskov et al. (2004) for a fc-nearest-neighbor 
estimate extending those of Shannon entropy in one dimension based on m-spacing; see e.g. 
Tsybakov and van der Meulen (1996), Dudewicz and van der Meulen (1981) andBeirlant et al. 
(1997) among others. Van Hulle (2005) derive an estimate using Edgeworth approximation of 
Shannon entropy. Darbellay and Vajda (1999), Wang et al. (2005) and Cellucci et al. (2005) 
propose estimates based on adaptative partitioning oi X x y. See also Khan et al. (2007) 
for an overview and numerical comparisons of these estimates. Based on the Kullback-Leibler 
importance estimation procedure, see Sugiyama et al. (2008), Suzuki et al. (2008) obtain an 
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estimate of KL-MI called maximum likelihood mutual information, see also Sugiyama et al. 
(2012) Chapter 11. Unfortunately, their (asymptotic) distributions remain inaccessible. Hence, 
testing independence from these estimates requires Monte-Carlo or Bootstrap approximations 
of the related p-values. On the other hand, the above nonparametric estimates suffer from loss 
of efficiency, due to smoothing or partitioning, and suffer also from the difficulty of conveniently 
choosing the classes, the number of classes or the smoothing parameters (the bandwidths and 
the kernels). The present paper introduces new efficient semiparametric estimates of <p-mutual 
information (<p-MI), i.e., dependence measures associated to (^-divergence functionals, includ¬ 
ing the well known KL-MI and y^-MI. These estimates are obtained by making use of a dual 
representation of (p-MI, presented in Section 2, without using any smoothing nor partitioning. 
The obtained estimates are dehned in the same way for both hnite-discrete or non-discrete 
distributions, and coincide with the direct plug-in ones in the case of hnite-discrete distribu¬ 
tions. Their asymptotic properties are presented in Section 3. Particularly, the consistency is 
stated for a large variety of semiparametric models for dP/dP-*-; the asymptotic distribution 
is obtained for the KL-MI estimate in a special setting. The present approach leads to new 
independence tests, whose Bahadur efficiency are compared in Section 4 ; the most efficient test 
is shown to be the one based on the proposed estimate of the particular KL-MI criterion. It can 
be used also in order to build a large variety of dependence models, through for instance a cross 
validation-type model selection procedure based on the proposed estimate of (p-MI measure 
of dependence; see Section 2.4. The powers of (p-MI based tests are compared numerically to 
classical noncorrelation tests in Section 5. The results in the present paper have the advantage 
(unlike the classical noncorrelation tests) to remain valid in the case of multisample problem 
(estimating (y9-mutual informations of a multidimensional random variable as well as testing 
simultaneous independence of its components), but for simplicity, the results will be presented 
only for the two-sample case. The same results hold for the multisample problem. All proofs 
are postponed to the Appendix. 

2. (/9-mutual informations. Dual representations and Estimation strategy 

Given an i.i.d. sample, (Xi, Yi),..., (X„, Ui), of a random vector {X,Y) taking values in a 
measurable space (A x y, Ax ® -A-y), we aim at testing the null hypothesis "Hq of independence 
of the margins X and Y ; formally 

"Ho : X and Y are independent, against "Hi : X and Y are dependent. (5) 

We derive such tests by estimating and thresholding (/j-mutual informations between X and 
y in a semiparametric context. Sections 2.1, 2.2 and 2.3 to follow, respectively, dehne (p- 
mutual informations, present the semiparametric model under study, and introduce estimates 
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of (^-MI used as test statistics for the test problem (5). Section 2.4 defines a cross-validation 
procedure for model selection among L candidate models for the ratio dP/dP-*-, using the 
proposed estimate of <p-MI. 

2.1. Introducing <p-mutual informations. Denote by M.i{X x the set of all prob¬ 
ability distributions on the product measurable space (X xy,Ax®Ay). Let 9 ? : M —)■ 
[ 0 ,-|-oo] be some nonnegative closed proper convex function such that its domain dom^ : = 
{x G M; (p(x) < 00} =: (a^, b^) is an interval, with endpoints < 1 < 6 ,^, and (p(l) = 0. The 
interval (a^, b^) may be bounded or unbounded, open or not. The y 9 -divergence between any 
probability distributions Q,P E M.i{X x 3^), if Q is absolutely continuous with respect to 
(a.c.w.r.t.) P, is defined by 

D^{Q,P) := j Lf dP{x,y). 

If Q is not a.c.w.r.t. P, we set D^{Q,P) = -|-oo. Note that D^{Q,P) > 0, for any Q and P. 
Moreover, if ip is strictly convex on some neighborhood of 1, we have the fundamental property 

D^{Q, P) > 0, with equality if and only if Q = P. 

In the following, we assume that the function p is strictly convex and two times continuously 
differentiable on the interior of its domain (a^, b^). We have then p'{l) = 0, and without loss of 
generality, we can assume that p''{X) = 1. The well-known Kullback-Leibler divergence ]K(-, •) is 
obtained for p{x) = pi{x) := xlogx—x-|-l, the “modified” Kullback-Leibler divergence ]Km(-, •) 
is obtained for p{x) = Pq{x) ■.= — logx -|- x — 1. The and modified-y^ divergences, denoted 
and Xmi'y')y are associated, respectively, to the convex functions p{x) = P 2 {x) := 
(x — 1)^/2 and <p(x) = <^-i(x) := (x — l)^/(2x). The so-called Hellinger distance P(-, •) is 
obtained for p{x) = <^ 1 / 2 ( 2 ;) := 2{y/x — 1)^; see Table 1. All these divergences are members of 
the so-called “power-divergences” •) associated to the convex functions dehned by 

p,{-) : X e R; ^ p,{x) := (6) 

if 7 G R \ {0,1}, po{x) := —logx -|- x — 1 and pi{x) := xlogx — x -|- 1. The standard 
divergences and P(-,-) are then associated, respectively, to 

the real convex functions pi{-), V^o (')5 9^2(Oi and pi/ 2 {.-)- Note that the divergences are 

generally not symmetric; particularly, we have for any Q,P E Aii(X xy), K^iQ, P) = IK(P, Q) 
and XmiQ^ P) — Q)- For niore details and proofs, we can refer to Liese and Vajda (1987) 

and Broniatowski and Keziou (2006). For any probability distribution P G A4.i{X x3^), let P^ 
denotes the product distribution P-*- := Pi $§ P 2 of the margins Pi and P 2 of P. The (p-mutual 
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information of P, associated to the divergence D^{-, •), is defined as 

UP)-D^(P,P^). 

For any random vector {X, Y) defined on a probability space (f2, A, P) and taking its values in 
{X X y, Ax ® Ay), with joint distribution P G Aii {X x jV), the <p-mutual information (<y9-MI) 
of {X, Y) is dehned to be 

P(X,Y) ■■= L(P) = m 

Since Zi)^(P, P-*“) > 0, with equality if and only if P = P-*-, i.e., if and only if X and Y are 
independent, (^-MI measures then the dependence between the random variables X and Y. In 
contrast to the correlation coefficients of Pearson, Kendall or Spearman, the (p-MI does not 
focus on the linear or monotonic relationship between random variables; it constitutes a proper 
dependency measure. Note that and with (fi and (p 2 given in Table 1, are, respectively, 
the KL-MI and y^-MI, given by (2) and (1). Thus, the test problem (5) is equivalent, in the 
context of criteria, to testing 

/<^(P) = 0 against > 0. 

Hence, we can use as test statistic an estimate of /i^(P), and reject the null hypothesis Hq when 
the estimate takes large values. A natural attempt to estimate the (p-MI of (X, Y) consists in 
considering the plug-in estimate of /(^(P) obtained by replacing P(-) by its empirical counterpart 

1 "" 

P(-) = -ZW.)(-). (8) 

i=l 

associated to the i.i.d. sample (Xi, Fi),..., (X„, K^) of (X, K). Here, S(x,y){-) denotes the 
Dirac measure at (x, y) for all (x, y) E X x y. Unfortunately, by doing so, we only measure 
dependence of the contingency table associated to the sample. When dealing with variables X 
and Y absolutely continuous with respect to Lebesgue measure, the contingency table is almost 
surely an n x n table with all coefficients except diagonal ones equal to zero ; particularly, 
variables X and Y appear (misleadingly) purely dependent, yielding to reject systematically 
the null hypothesis. A second, less crude, approach consists in gathering the values Xj and 
Yi into classes and testing independence between the induced hnite-discrete variables X and 
Y, by empirically estimating the (p-MI of {X,Y). This widespread approach suffers from the 
difficulty of conveniently choosing the classes. Moreover, an important amount of information 
carried by the sample is lost during this process, yielding to poor efficiency - or power - of 
these tests. An other approach, is to use kernel nonparametric estimates of the joint density 
and the marginal ones, but as it is well known this provides less efficient estimates and leads 
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to the difficulty of choosing the optimal smoothing parameters. As an alternative, we propose 
in the present paper semiparametric modeling of the ratio dP/dP"*", and the use of duality 
to obtain well-dehned estimates of ^j-MI without smoothing nor partitioning. The present 
approach applies for both continuous or discrete distributions, or mixtures of continuous and 
discrete distirubtions. 


2.2. Semiparametric modeling of the ratio dP/dP-*-. Assume that the joint distribution 
P of the random vector {X, Y) belongs to the semiparametric model 


dP 

Me := { P E Mi{X X y) such that ') =• ■)', d E Q 


( 9 ) 


where 0 C is the parameter space, and ■): {x,y) E X x y he^x^y) G M is some 
specihed real-valued function, indexed by the parameter 6. In the sequel, we will consider the 
following assumptions on the model Me- 

(A.l) {he{x,y) = hgi{x,y)M{x,y) E X xy) {0 = 9') (identihability); 

(A.2) there exists (a unique) Oq E int(0) satisfying h 0 ^^{x,y) = 1, y{x,y) E X xy. 

Assumption (A.l) is a natural identihability condition for dP/dP^. Assumption (A.2) ensures 
independence is covered by the model Me- The uniqueness of Oq follows from Assumption 
(A.l). Denote by 6t the “true” unknown value of the parameter, namely, the unique value 
satisfying 

dP 

■^{x,y) = he^{x,y), \/{x,y)EXxy, 

which is assumed to be an interior point of 0. Then, we have Oj- = 9 q if and only if X and Y 
are independent. Below are listed some relevant examples of the model (9). 


Example 2.1. Let {X,Y) G be a centered Gaussian random vector with correlation coeffi¬ 
cient p g] — 1,1[ and centered normal margins with the same variance > 0. A straightforward 
computation shows that the ratio dP/dP-*- can he written under the form of the model (9) where 

hg{x,y) = exp {a +/3i(x^ p^) + (32xy] , (10) 

6 := (a,/5i, 132 )'^ E with a = — log(l — p^)/2, /3i = —p^j (2(J^(1 — p^)) and 1^2 = p/ (<t^(1 — 
p^)). Note that the parameter value, corresponding to the independence hypothesis, is Oq = 
(0, 0, 0)"'". Moreover, if the distribution of{X, Y) is Gaussian with unknown mean p := (pi, p. 2 )T 
and unknown variance matrix T, then we can show that the ratio dP/dP-*- can he written under 
the form of the model (9) with 

hg{x, p) = exp {a -f- I3ix + + jd^xy] , 


( 11 ) 
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and 9 := (a, /3i, / 92 , /Ss, Pa, PbY ■ Note that the number of free parameters in 9 t is d = 5, and that 
ar is considered as a normalizing parameter due to the constraint h 0 j,{x,y) dF-^{x,y) = 
fxxy 2/) = 1 since F is a probability distribution. Moreover, we have 6 *o = (0,..., 0)"'" G M®. 

Example 2.2. Let pop, ■) := l^xyp, pjpip, ■),p 2 p, ■),■■■, be some basis functions of the 

space x 3^, P-*-), and assume that log(dP/dP-‘-(-, •)) & Lf [X x We can then build 

increasing models of the form (9) developing the function 

dP 

{x,y) e X xy ^ log ^( 2 ^, y) 

according to the above basis functions. Using for instance the first (1 + d)-basis functions, we 
obtain the following model for dP/dP-*-(-, •) 

hg : {x,y) e X X y hg{x, y) = exp [a + Pipi{x, y) ^ -h PdPd{x, y )), 

where 6 = {a, Pi,..., Pd)'^ G 0 C Then, the independence parameter valne is 6o = 


Example 2.3. Assume that the support ofF, supp{F) =: X x y, is a known finite-discrete set 
of size K 1 K 2 ; denote by := {Px,y)(^,,^y)^xxy density ofF with respect to the 

counting measure on X xy. Then we have 


dP 

d^ 


where 


{x,y) = exp I ^ da,h l{a}(a;) 1{6}(2/) ) , 

{a,h)&Xxy 


e,^g = \og^, {a,b)eXxy. 

PaPb 


If we denote for instance the elements of X and y as follows 


( 12 ) 


X := {ai,...,aK,} and y := {6i,..., 6x2} . 


then we can see that P belongs to the model (9) taking 


hg{x,y) = exp a + ^ Aj t{ai}{x) l{fe^}(2/) , (13) 

\ {hj)¥=FP) / 

with the parametrization 9 = {a,P'^Y ^ where a is a scalar and P = (A,j)(i,i)^(i,i) 

is the {K 1 K 2 — 1)-dimensional vector obtained from the Ki x K 2 -matrix of real entries {Pij) 
removing the first entry Pi^i. Moreover, we have for the true value 9 t 


1 PaiM 

ar = log-, 

PaiPb^ 


and Pijrp 


log 


Pai,bj 

PaiPbj 


- log 


Pa\,bi 
PaiPbi ’ 
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for all {i,j) G {1,..., Ki} x {1,..., K 2 } \ {(1,1)}, and that the number of free-parameters in 
6 t is equal to {Ki — 1){K2 — 1). Moreover, we have Oq = {0,..., O)"’" G 


Example 2.4. Assume that the distribution P of the random vector {X, Y) G is of con¬ 
tinuous margins. The copula C(-,-) of the vector {X,Y), see e.g. Nelsen (2006), is defined, 
\/{u,v) g] 0, Ip, by 

C{u,v)-.= F{F{\u),F^\v)), 


where F{-,-) is the cumulative distribution function of the vector {X,Y), and Fi and F 2 are 
the (marginal) cumulative distribution functions of X and Y, respectively. The copula 
is in itself a distribution function on ] 0 , Ip. If F{-, •) is absolutely continuous with respect to 
the Lebesgue measure on then we have the relation 


dP 


{x,y) = 


f{x,y) 


= c{F,{x),F 2 {y)), 


dP^^ Mx)f2{y) 

where /(•, •) is the joint density of {X, Y), fi and f 2 are the marginal densities of X and Y, and 
c(-, •) the copula density. Numerous parametric examples of the model (9) can then be obtained 
taking the function 

hg{x,y) = C 0 {Fi^^^{x),F 2 ,-f^{y)) (14) 


where {c^(-, ■); (3 E D G R"*} is some parametric copula density model, see e.g. Nelsen (2006) 
or Joe (1997) for examples of such models, and {Fi..yj;7i G Fi} and {F 2 ^^^;'y 2 G r2} are some 
parametric models for the marginal distribution functions. Here, the parameter of interest is 
9 := (71,72,/?) G 0 := Fi X r2 X Note that the assumption (A.2) is generally not satisfied 
for this particular model. In fact, if we denote /?o the particular value corresponding to the 
copula of independence, then we have /i(7i,72,/3o)(')') ~ ^ {' 11 , 72 ) G Fi x F 2 . Although 

assumption (A.2) is generally not satisfied, models (If) can be used in estimating ip-MI under 
the assumption that the margins are dependent. 


Example 2.5. We can also deal with semiparametric models induced by semiparametric models 
of copula densities, with nonparametric unknown continuous marginal distribution functions 
Fi(-) and F 2 {-), taking 

hg{x,y) = ce{Fr{x),F 2 {y))-, 0 G 0 C R'". 

2.3. Dual representation and dual estimation of <y9-MI. We define estimates of <y9-MI by 
taking advantage of the modeling (9) and the dual representation of <y9-divergences obtained in 
Keziou (2003) and Broniatowski and Keziou (2006). Denote (^*(-) the convex conjugate of the 
convex function g}{-), namely, the function defined by 

ip* -. t E R ip*{t) := sup {tx — p{x)} G R U {+cxo}. 
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Note that <^*(-) is, in turn, a proper closed convex function, in particular, <^*(0) = 0. Assume 
that (p(-) is essentially smooth, i.e., differentiable on ]a^p,h^p[ with = “Oo if dip > 

—oc and (p'{x) = +cxo if < +cxo. This is equivalent to the condition that p*{-) is 

strictly convex on its domain. Provided that 

(A.3) the v9-mutual information /;^(P) < oo, 
see its dehnition (7), it can be rewritten under the form 

J^(P) = sup(/' f{x,y)^{x,y)- ! p* {f{x,y)) dF^{x,y)\ , (15) 

/Gjp Uxxy Jxxy J 

where is any class, of measurable real-valued functions / : A x (V —)■ M, that contains the 
particular function <yc'(dP/dP-‘-) and satishes the condition dP < cxo, for all / G A. 

Note that, for all x G (a<^, 6 ^), we have 

p\p\x)) = xp\x) - p[x). 

In Table 1 are given explicit formulas of convex conjugates of some standard divergences. From 


DXr) 


dom(y9 

dom(y9* 

^*(-) 

IKm(-, •) 

Po(x) ■= — logx + X — 1 

]0, -Foo[ 

— oo, 1[ 

- log(l - t) 

K(.,.) 

Plix) := X logx — X -f 1 

[0, -1-00 


R 

e* — 1 

Xli-r) 

(/p-i(x) J 

]0, -Foo 


]-oo,i] 

1 - VI - 2t 

x^{-) 

P 2 ix) -.= 1 (x- 1)^ 

M 

R 


Hi;-) 

Pi/2(x) := 2(x/x - 1)^ 

[0, +00 


] — oo, 2[ 

2t 

2-t 


Table 1. Convex conjugates for some standard divergences. 


(15), taking into account the model (9) by specifying 

A = V(M;^e0}, 

and assuming in addition that 

(A. 4) for all 0 G 0, we have W{h 0 {x,y))\ dP(a:, i/) < oo, 
we obtain 

4(P)=sup|/' p\he{x,y))d¥{x,y)- ! ((^'(^^(a:, y))) dP^(x, i/)|. (16) 

See VJxxy Jxxy ) 

Moreover, the supremum is unique and achieved in 9 = 9 t- The uniqueness of the supremum 9t 
follows from the strict convexity of (^*(-) and the identihability assumption (A.l). We propose 



























SEMIPARAMETRIC ESTIMATION OF MUTUAL INFORMATION AND RELATED CRITERIA 


11 


then the following “dual” estimate of /(^(P) 


:= sup 


dee Uxxy 


ip' {hg{x, y)) dP(a:, y)- p* {p'{he{x, y))) dPi P 2 (a:, y) 


Jxxy 

n n 


= ^ S Yi)) - {he{Xi, Yj))) j> , 


0 ee n 


(17) 


i=l 2=1 j=l 

and the following “dual” estimate of the parameter 6 t 


9^ := argsup 


e»ee Uxxy 


p' {hg{x, y)) dP(x, y)- p* {p'{hg{x, y))) dPi (g) P 2 (x, y) 


JXxy 

n n 


argsup 1 i W (he(Xi, y,)) - 1V V {he(X„ Yj))) [■, 

"■se I " ht " Cl pi ' 


(18) 


where P(-) is the empirical distribution, associated to the sample, given by ( 8 ). For ease of 
presentation, dehne, W9 E Q and \/{x,y) E X x y, the functions 


fg{x,y) := p'{hg{x,y)), 


(19) 


gg{x, y) := p* {p'{he{x, y))) = he{x, y)p' {hg{x, y)) - p {he{x, y)), (20) 

which we assume to be continuous, in 9, on the set 0 , 

M : 9 E Q ee M{9) := f fg{x,y) dF{x,y) - [ ge{x,y) dPi <X)f’ 2 (,x,y) (21) 

Jxxy Jxxy 

and its empirical version 

Mn ■■ 9 E Q ee Mn{9) := [ fe{x,y) dF{x,y) - [ gg{x,y) dFi <^¥ 2 {x,y). (22) 

Jxxy Jxxy 

Therefore, the formula (16) becomes 


J^(P) = sup M{9) = M{9t), and 9t = argsup M{9). 
dee dee 

The estimates (17) and (18), in turn, can be written as 

= sup Mn(9) = Mn(9^) 
dee 


(23) 


(24) 


and 

= argsup ( 6 '). (25) 

dee 

Note that the functions •), gg{-, •), M(-) and Mn(-) all depend on p{-), but the subscript p 
is omitted for simplicity. 
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Example 2.6. In the context of finite-discrete distributions, using the exponential model de¬ 
scribed in Example 2.3, we show that the proposed dual estimate (17) of I^(F), obtained by the 
above “duality” technique, equals the direct pluq-in one 


Tr ■= L(p) = E 

(x,y)&Xxy 

Indeed, we have by its proper definition 


Px,y 

PxPy 


PxPy 


(26) 


4 = sup where Mn{9) = ^ + (p{e^^^^)PxPy\ 

(x,y)&Xxy 


0G0 


(27) 


Differentiating (27) with respect to O^^y for {x,y) e X x y yields 

d 


89 . 


-Mn{9) = {e^^'^Px,y - PxPy) • 




Canceling derivatives ■7T^Mn{9) yields 


9x,y = log 1^, (x, y) ex xy, 

PxPy 

which is independent from the choice of ip for this particular model. Finally, straightforward 
simplifications yield 


4 = Mn{9) 


{x,y)GXxy 




Particularly, for ip{x) = ip 2 {x) := (x — 1)^/2, the estimate 1^,^ of the x^-f^utual information 
- or measure of independence - obtained by the duality technique is shown to equal (up to 
the factor 2n) the classical statistics. Hence, in the context of finite-discrete distributions, 
using the exponential model described in Example 2.3, we see that the proposed approach, via 
duality technique, recovers the classical direct plug-in one, in particular, the well-known classical 
-independence test. 


Remark 2.7. For finite discrete distributions (with known support, of size say K, see Example 
2.3), as in plug-in estimation of Shannon entropy (see e.g. Chao and Shen (2003)), the direct 
plug-in estimates are valid with small bias if the sample size n » K. If the sample size 
n is not sufficiently large compared to the space size K, models hg{-) other than (12) should be 
used (through e.g. the model selection procedure described in Section 2.4), with small parameter 
dimension, and the corresponding dual estimate 1^,, if the model hg{-) is correctly specified, could 
be more promising than the direct plug-in one 7®™^. 
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Example 2.8. Note that when dealing with semiparametric copula models 

he{x,y) = ce{Fi{x),F 2 {y)), 

with unknown nonparametric cumulative distribution functions Fi and F 2 , it is necessary to 
estimate them, using for example their empirical counterparts. Denote by Fi{-) and F 2 {-) the 
empirical cumulative distribution functions associated, respectively, to the samples Xi..., Xn 
and Yi,..., Y^, i.e., 

^72 1 ^ 

Fi{x) ^ and ^ 2 ( 1 /) ■= - ^ l]-oo,y](^0- 

i=l ^ i=l 

So that I^p and 6^ become 

( Y ^ ^ 72 n 

L = 1 (po (^i(T), AT))) ('=» {Dx,).h{Yj) 


eee n 


7=1 


i=i j=i 


9^ = argsup < — E (p' (ce (A(X,),F2(y,)))-- y.* [if' (ce (f,{X,),F2{Yj) 

L *=i *=i j=i 

Note that nFi(Xi) is the rank of X^ in the sample Xi,... and nF 2 {Xj) is the rank ofYj in 
the sample Yi,... ,Yn. For some copula models, the copula density Ceiui, U 2 ) may be unbounded 
when either ui or U 2 tends to 1; see e.g. Genest et al. (1995). In this case, to avoid this 
difficulty, the “rescaled” empirical cumulative distribution functions 


Fi(-) := 


n 


-Fi( 


Foi-) : = 


n 


n+1 n+1 

should be used instead of the standard ones Fi{-) and F 2 {-). 


-Fo(-) 


2.4. A model selection procedure for the ratio dP/dP-*- through ^j-MI criterion. Let 

Alei := { V,i(l •); 6*1 e 01 C ,..., := [he^^ii-, •); e ©l C be L candidate 

models for the ratio dP/dP-*-. For any model A7e^, denote by 61 the estimate of 9t given by 

:= arg snp Mn{9t). 

di£ 0 ( 

The corresponding “expected” criterion is 

M{ee)= [ fQ^{x,y)dF{x,y)- [ gQ^{x,y) d¥{x,y)^. 

Jxxy Jxxy 

From the representation (23), we can see that the larger the expected criterion M{9() of the 
model is, the closer the model is to the trne one. We propose then the following fc-fold cross- 
validation procednre for model selection using the proposed estimate (24) of 99 -MI. 
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(1) Partition the sample (Xi, Yi), ..., (X„, F„) into k equal size {rik) subsamples. (Denote 

the i-th subsample ..., for alH = 1,..., A;); 

(2) Consider a candidate model Xle^; 

(3) From the sample (Xi, Yi),..., (X„, Yn) remove the Fth subsample; compute the estimate 
^ given by (25) using the remaining n — Uk observations, i.e., 

= arg sup 
9e&Oe 

(4) Repeat steps (2) and (3) for alH = 1,..., /c, and obtain the following “estimate” 


^ /c ( \ 


inj^ 


i=l \ j=(i-l)nfe+l ^ j,m={i-l)nt,+l 


of the expected criterion M{6^), i.e., 


M{9e)= fg(x,y)dF{x,y)- gg(x,y)dF{x,y)^; 


'xxy 


'Xxy 


(5) Repeat steps (2-4) for all £ = 1,...,L, and select the “optimal” model that 

maximizes C\/(A40j over = 1 ,..., L, i.e., the model A4e^* with 


t := arg sup Cv{,Mq^). 

Other model selection-type procedures can be investigated, through e.g. correcting the bias of 
Mn{6i) as an estimate of the expected criterion and selecting the model that maximizes 

the obtained information criterion corrected from bias. The correction can be made e.g. by 
asymptotic evaluation of the bias as in classical AIC criterion, or using bootstrap; see e.g. 
Konishi and Kitagawa (2008) and Shao and Tu (1995). 


3. Asymptotic properties of the estimates 

We state in Section 3.1 the consistency of both estimates I^p and 9^p^ of the y9-MI and the param¬ 
eter 9t- Section 3.2 gives, under the null hypothesis of independence, the limiting distribution 
of the estimate of the KL-MI, as well as the corresponding estimate 9^^ of the parame¬ 
ter 9t, for some specihc forms of the model •); 9 G 0}. Section 3.3 provides bootstrap 
calibration of the critical value of any J^-based test statistic for general forms of the model 
{hei-r);0 ee}. 
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3.1. Consistency. In this section, we state consistency of the estimate of the (/j-MI, dehned 
by (17), as well as the consistency of the estimates 6^^ of 6 t- We will nse classical techniqnes 
from M-estimation theory. We will make nse of the following conditions. 

(A.5) The parameter space 0 is a compact snbset of M x ; 

(^•6) \fe{x,y)\ dP(a:,i/) < cx); 

(^•’^) < cx), 

where fe and ge are dehned respectively by (19) and (20). Note that assnmptions (A.6-7) imply 
(A.3-4). 

Proposition 3.1. Assume that conditions (A.l, 5-7) hold. Then, the estimates of I^{¥) 
defined by (17) and the estimates of 9 t defined by (18) are consistent. Precisely, as n ^ oo, 
the following convergences in probability hold 

Ip 7(p(P) and 6p —)■ Ox. 

Remark 3.2. Since in practice, all models are generally “misspecified”, the true parame¬ 
ter value 6 t may not exist, it can however be replaced by the “pseudo-true” value : = 
argsnp 0 g@ M( 0 ), and the results of consistency in the above proposition remain valid. 

3.2. The limiting distribntion of the estimate of KL-MI. We will give now the 
limiting distribntion of the particular statistical test based on the estimate of classical 
KL-MI, for specihc forms of the model hg{-,-), under the null hypothesis of independence 
Pq ; P = P-*“. Consider the following specihc form of the model hg{-,-) 

d 

hg{x,y) = exp {a-\-mi 3 (x,y)) with my{x,y) := E Mk{.x)Ck{,y), (28) 

k=l 

for some specihed measurable real valued functions fk and (k^ k = 1 ,... ,d, dehned, respectively, 
on X and 3^. The parameter 6 is the vector 6 := (a, fii,..., G © C M x In this case, 
the functions (19) and (20) become 

d 

fe{x, y) = a+ Mk{x)C,k{y) 

k=l 

and 

ge{x,y) = exp ^/3fc6(x)Cfc(i/) j - 1. 

The value 6*o, corresponding to the independence, here is 6^0 = 0 := (0,... ,0)^ G We 

will give the limiting distributions of 6 ^.^ and under the null hypothesis of independence 
P = P-*“, i.e., when 6 t = do = 0 . We will consider the following assumptions. 
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(A.8) There exists a neighborhood N{ 6 t) of 9t such that the third order partial derivative 
functions {{x,y) i-A {d^/d^ 6 )f 0 {x,y)-, 9 G N{9t)} 

(resp. {{x^y) i—)■ {d^/d^9)gg{x,y); 9 G N{9t)}) are dominated by some functions P- 
integrable (resp. some function P-*--square-integrable); 

(A.9) The integrals P ||/ey|f, P"*" Hfi'erlT’ ^ ll-^eTlI’ Ikerir and the matrix 

Si := - (p/;; - (29) 

is nonsingular. 

Theorem 3.3. Assume that conditions (A.1-2,5-9) hold and that¥ = P-*- (i.e., 9 t = 0)- Then, 

(a) converges in distribution to a centered multivariate normal random variable with 

covariance matrix S = where Si and S 2 are given respectively by (29) 

and ( 40 ); 

(b) converges in distribution to the random variable Z'^Z , where Z is a centered 
multivariate normal random variable with covariance matrix 

C = S^^/^SaS^^/^ 

Remark 3.4. For the finite-discrete case, using the modeling (13) in Example 2.3, we can see 
that the corresponding matrix S 2 is of rank (A'l — 1 )(A '2 — 1) and that the limiting distribution 
of 2 nI,^ = is a x^-distribution with {Ki — 1 ){K 2 — 1 ) degrees of freedom, in particular, we 

recover the classical -independence test theorem (for the case of finite-discrete distributions). 

3.3. Bootstrap calibration. In the general context of model (9), for a given we pro¬ 

pose the following bootstrap procedure to calibrate the critical value of the corresponding test 
statistic. The critical value, denote it ba, is the upper o-quantile of the distribution of the test 
statistic Sn ■= 2 nl^, under the null hypothesis TLo of independence. 

(1) Generate bootstrap sample (X*, Yj*),..., (X*, Yf) from the product empirical distribu¬ 
tion P-*- = Pi (g) P 2 of the original sample (Xi, Yi),..., (X„, Y„); 

(2) Compute the value of the statistic S'* := 2nl* from the bootstrap sample; 

(3) Repeat steps (1) and (2) B = 1000 times, independently, to obtain the realizations 

f C* C* Q* \ • 

^n,25 • • • 7 

(4) Estimate b^ by ba := the (1 — a)th quantile of the sequence {S'* 1 , S'* 2 , • • •, *5'*^^}. 

4. Large deviations principle and Bahadur asymptotic efficiency 

In this section, we compare Bahadur asymptotic efficiency of (^-MI based independence tests 
and show that the test based on classical Kullback-Leibler mutual information is the most 
efficient. Given and {^ 2)71 two sequences of statistics, for the test problem (5), numbers 
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« G (0,1), 7 G (0,1) and an alternative hypothesis P 7 ^ P-*-, we dehne ni(a, 7 , P), for i G {1, 2}, 
respectively, as the minimal nnmber of observations needed for the test based on to have 
signihcation level a and power level 7 . Then, Bahadur asymptotic relative efficiency of 
with respect to (/<^ 2 )n is dehned as (if the limit exists) 


772 ( 0 , 7 , 


lim 

a-i-o ni{a, 7,. 


It is well known, see for example Nikitin (1995) and van der Vaart (1998) Chapter 14, that if 
both sequences (/^^Jn and {I^ 2 )n satisfy a large deviation principle under the null hypothesis 
(with good rate functions e^^(-) and 6 (^ 3 (•)) and also a law of large number under a given 
alternative hypothesis "Hi : P 7 ^ P-*-, with asymptotic means and respectively, 

then the Bahadur asymptotic relative efficiency equals ei^j(/i^j(P))/e^ 2 (/ 7 ^ 2 (P)). Particularly, 
the most efficient test maximizes Bahadur slope e<^(/i(^(P)). A law of large number under the 
alternative hypothesis is given for the sequence {Ip)n in Proposition 3.1 above; the excepted 
value /n^(P) being = /(^(P) = /1(^(P,P"*"). The following theorem establishes a large 

deviation principle under the null hypothesis of independence. It relies on some generalization 
due to Eichelsbacher and Schmock (2002) of classical Sanov theorem to hner topologies and 
the contraction principle. Let Q be the set of measurable functions, from X x y into R, given 
by 

g:=BU {if'(he)-, 0 G 0} U {ip*{ip'{he))', B G 0}, 


where B is the set of all measurable bounded functions from X x y into R. Recall that 
M.\ = M.i{X X (V) is the set of all probability measures on A x y, and let us introduce the 
subset 


Mg-.= Mg{Xxy)-=\p eMu ! \ip'{he)\dP <00, ! \ip*{ip'{he))\dP^ < oo.'iB e Q 

I Jx^y Jxxy 

Dehne on Aig the rg-topology as the coarsest one that makes applications P G Aig ha- 
IxxyB^'(^s) dP, P G Aig i-G- !xxyf ^ ^ -^5 Ixxy^*(B^'(^s))dP-^ and P G Aig ha 

Ixxy f continuous, for all 6^ G 0 and all f E B. Finally, dehne, for all Q G Aig, the 
“pseudo-divergence” 



Vp{Q,Q^) ■■= snp I f ip'{he{x,y))dQ{x,y)- [ ip* {ip'{hg{x,y))) dQ^{x,y) 
dee IJxxy Jxxy 

Obviously, Vp{Q, Q^) < Dp{Q, Q^) =: Ip{Q) with equality for probability distributions such 
that dQ/dQ^ = hg for some 0 G 0. Note also that Q G Aig ha Vp,{Q,Q^) is continuous with 
respect to the rg-topology as the supremum over the compact set 0 of continuous functions. 
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The large deviation principle for the sequence (P(-))„ of empirical measures dehned by ( 8 ), es¬ 
tablished by Eichelsbacher and Schmock (2002), requires the existence of exponential moments; 
in the context of the model (9), we thus assume 
(A.10) for all f & G, for all a > 0, 

/ exp(a|/|) dP < oo. 

Jxxy 

Note that the strong assumption (A. 10) implies (A.3-4) if P = P-*-. In the context of the 
models described in Examples 2.1 to 2.5, assumption (A.10) may not be satished for some 
(^-divergences ; particularly, it does not generally hold for power-divergences (except for hnite- 
discrete distribution models described in Example 2.3). A sufficient condition for (A. 10) is 
(A.11) there exist real numbers m, M G such that m < he{x,y) < M, W{x,y) G 

X xy,yeee. 

Indeed, for all a > 0, the functions exp{a\ip'{ho)\) and exp{a\ip*{ip'{he))]) are bounded and 
therefore integrable with respect to both P and P-*-. Again, (A.11) is not generally satished 
for models described in the previous examples for power-divergences, but it may be artihcially 
verihed by truncating the distributions in the models. Let us also point out that Theorems 4.1 
and 4.2 below may remain true with some alternative assumptions on the distribution queues, 
lighter than (A. 10). Particularly, simulations performed in Section 5 for bivariate Gaussian 
distributions tend to show that Theorem 4.2 holds for the Gaussian model described in Exam¬ 
ple 2.1. For getting a closed form for the LDP of (/(^)n, we will establish the right-continuity 
of the rate function, making use of one of the following assumptions: 

(A.12.a) {X,Y) is hnite-discrete, supported by A x (V; 

(A.12.b) The model [hg{-,-)-, 9 = {a, 0^)'^ G 0} is of the from he{x^ y) = exp (« -|- mp{x^ y)) with 
the condition that, for any constant c and any 0 we have P-*- {mp{X,Y) = c) 7 ^ 0 iff 
/9 = ( 0 ,..., 0 )"'" and c = 0 . 

Theorem 4.1. Let {X,Y) be a couple of independent random variables with joint distribution 

p = p-L G Me nMg. 

(1) Suppose that conditions (A.1-2, 5-7, 10 and 12.h) are satisfied. Then, the sequence {I0)n 
of estimates, 0 /J^(P) = 0, given by (17), satisfies the following large deviation principle 

{Y^> d^"^-e0d), d > 0, (30) 

where the good rate function e0-) is 

e0d) := inf K(Q,P-‘-) with Qa ■= {Q ^ Xig such that T>^p{Q,Q'^) > d] . (31) 
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(2) Assume that conditions (A. 1-2, 5 and 12.a) are satisfied. Then the above statement holds 
if Mg is replaced by the set of all discrete-finite distributions with the same finite support 
Xxy. 

In view of Proposition 3.1 and Theorem 4.1 above, the Bahadur slope of the independence test 
based on for any ip, is given then by 

:= e^{I^{F)) 

= inf{K(g, P^) ; V^{Q, Q^) > P^)}. 

Since = Z1^(P, P-*-), we have P G {Q : V^p{Q,Q-^) > Z1;^(P, P-*-)}, so that, for any (p, 

< K(P,P^) = Jxl(P) = 4i(P). (32) 

Equality is achieved in (32) for the divergence = K. Indeed, 

Skl = inf{K(g,P^) : Vkl{Q,Q^) > K(P,P^)}. 

Straightforward computations yield 

K(g, p^) = K(g, g^) + K(gi, Pi) + K(g 2 , P 2 ), 

for any Q G Mg. Particularly, for any Q such that VxLiQyQ^) > ]K(P,P-*-), we have 
K(g, g^) > Vkl{Q, g^) > K(p, P^), hence, 

K(g,p^) > K(g,g^) > k(p,p^), 

so that 

s^i>K(P,P^). (33) 

Combining (32) and (33), we obtain 

Theorem 4.2. Let {X, Y) be a couple of random variables with joint distribution P G MsAMg. 
Suppose that either conditions (A. 1-2, 5-7, 10 and 12.b) or (A. 1-2, 5 and 12.a) are satisfied. 
For the test problem (5), the test based on the estimate see (17), of the Kullback-Leibler 
mutual information, is uniformly (i.e., whatever be the alternative P 7 ^ P’''^ the most efficient 
test, in Bahadur sense, among all I^-based tests, including the classical -independence one. 

Remark 4.3. Assume that F is a finite-discrete distribution. We obtain then that KL-MI based 
independence test is more efficient than the classical independence one. This result was 
already stated, in goodness-of-fit testing for finite-discrete distributions, see e.g. van der Vaart 
(1998) Chapter 17 Section 17.6. The above theorem extends it to testing independence, for 
more general probability distributions, not necessarily finite-discrete. 
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5. Simulations 

This Section aims at numerically comparing through simulations ^j-MI based tests with other 
independence or non-correlation tests. Precisely, Section 5.1 focuses on hnite-discrete random 
vectors, for which the optimal KL-MI test is compared to the very popular (but not optimal) 
X^-independence test. Section 5.2 compares KL-MI and tests to classical non-correlation 
tests of Pearson, Kendall and Spearman. Finally, Section 5.3 deals with the example of the 
copula density model of Farlie-Gumbel-Morgenstern (FGM), for which the critical values of 
KL-MI and y^-MI tests are derived through the bootstrap procedure described in Section 3.3. 

5.1. Testing independence of finite-discrete random variables. As stated in Exam¬ 

ple 2.6, the dual estimates given by (17) equal the direct empirical ones (26). Their prop¬ 
erties and asymptotic behavior are well-known; see e.g. Pardo (2006). They are recovered 
by Propositions 3.1, Theorem 3.3 and Theorem 4.2. We illustrate these properties through 
simulations, by comparing the power of KL-MI and y^-MI tests, for various sample sizes and 
hnite-discrete supports A’ = A’ = {1,...,A'}, and for alternatives P E xy) of the form 

Pe ■= {Px,y,e){x,y), with 

Px,y ,9 = (^ - ^)^ + (^, y) E X X y, (34) 

where K = \X\ = lA’I and 6 E (0,1), i.e., the random variables X and Y are uniformly 
distributed on the set {l,...,iF}, and the conditional distribution Py\x=x{-), of Y knowing 
AT = x, is the mixture of the uniform distribution on { 1 ,..., JF} with weight (1 — 6') and the 
Dirac measure Sx{-) with weight 6 , for all x E {1 ,... ,K} . Hence, for 6 = 6o = 0 , X and Y are 
independent, while for 0 = 1, we have Y = X. The level of the tests has been set to a = 0.01. 
The asymptotic distribution of 2 ? 7 ,/^ is {{K — 1){K — 1)), a y'^-distribution with (K — 1)^ 
degrees of freedom, for both KL-MI or y'^-MI. The critical value 6 o.oi of both test statistics 
is taken then to be the upper 0.01-quantile of the y^ ((iF — l)(iF — l))-distribution. Then, 
we have estimated their respective powers, by means of Monte-Carlo procedure from 10000 
samples drawn according to Py given by (34), for various mixture parameter values 6 E (0,1). 
The results are presented in Table 2, Figure 1 and Figure 2. We can see that the KL-MI test 
outperforms the classical one. The nominal levels of both KL-MI and y^-MI test statistics 
are both close to the test level a = 0 . 01 . 

5.2. Comparison of ip-Ml based and noncorrelation tests in the Gaussian setting. 

For bidimensional normally distributed random vectors, the corresponding model see 

Example 2.1, is of the form (28), so that the asymptotic distribution of the dual KL-MI based 
test statistic 2nl^^ is explicit. Hence, explicit (asymptotic) critical value can be obtained for 


SEMIPARAMETRIC ESTIMATION OF MUTUAL INFORMATION AND RELATED CRITERIA 


21 


II 

II 

II 

to 

CO 

II 

0 

0.08 

0.18 

0.28 

0.38 

0.48 

0.58 

0.68 

KL-MI test power 
n = 30 . 

test power 

0.0123 

0.0102 

0.0242 

0.0200 

0.0647 

0.0550 

0.1681 

0.1433 

0.3343 

0.2968 

0.5690 

0.5330 

0.7981 

0.7703 

0.9415 

0.9288 

KL-MI test power 
n = 40 . 

X test power 

0.0119 

0.0100 

0.0213 

0.0184 

0.0764 

0.0694 

0.2176 

0.2006 

0.4502 

0.4272 

0.7180 

0.6970 

0.9046 

0.8957 

0.9850 

0.9839 

II 

m 

II 

II 

II 

0 

0.07 

0.15 

0.23 

0.31 

0.39 

0.47 

0.55 

KL-MI test power 
n = 35 « 

X"^ test power 

0.0192 

0.0081 

0.0261 

0.0118 

0.0604 

0.0371 

0.1503 

0.1157 

0.3162 

0.2708 

0.5267 

0.4878 

0.7476 

0.7259 

0.8952 

0.8895 

KL-MI test power 
n = 50 . 

X^ test power 

0.0152 

0.0088 

0.0261 

0.0167 

0.0782 

0.0648 

0.2152 

0.1929 

0.4369 

0.4283 

0.7150 

0.7124 

0.9039 

0.9057 

0.9816 

0.9832 


Table 2. Comparison of powers of KL-MI and tests. The number of cells 

K is indicated at the top left of each block. The sample sizes n are given by the 
first column while the mixture parameter values 9, see its definition in (34), are 
given by the first row. 



Figure 1. Comparison of KL-MI and y^-MI based tests for finite-discrete ran¬ 
dom variables taking values in {1, 2}, with n = 30. 
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Figure 2. Comparison of KL-MI and based tests for finite-discrete ran¬ 

dom variables taking values in {1, 2, 3}, with n = 35. 


the test statistic 2nl^^. Although assumption (A. 10) may not be satished without restricting 
the support of {X, Y) to a bounded subset of we can compare numerically the powers of the 
9 ?-MI based tests. Precisely, in this Section we manage to compare the powers of KL-MI and 
X^-MI independence tests with noncorrelation tests for samples of size n = 50 drawn according 
to bivariate normal distributions. We have hxed the level a = 0.05 and computed the critical 
value of KL-MI based test by means of Monte-Carlo simulations of the asymptotic distribution 
of 2nl^^ given by Theorem 3.3 (10000 samples of the variable Z in Theorem 3.3 have been 
simulated; the critical value has been obtained as the 0.95-quantile of the linearly interpolated 
empirical cumulative density function). The critical value for the y^-MI based test have been 
estimated directly by simulating 10000 samples of size 50 of a bivariate Gaussian random 
vector with independent centered and reduced distribution and computing the 0.95-quantile 
of the corresponding tes statistic 2nl^^. Then we have estimated the power of these tests as 
well as noncorrelation tests of Pearson, Spearman and Kendall, still by Monte-Carlo methods: 
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for any correlation value p G {0,1/20,2/20,...,!}, we have considered N = 1000 samples, 
with size n = 50, of centered bivariate Gaussian couples with marginal variances equal to 1 
and covariance p varying from 0 to 1. Recall that the noncorrelation test of Pearson, for this 
particular Gaussian model, is the most uniformly powerful test, among all tests with the same 
level a. Figure 3 presents the power curves for KL-MI (plain black curve), y^-MI (dotted black 
curve) independence tests, and Pearson (dashed red curve), Kendall and Spearman (mixed 
dashed and dotted red and blue curves) correlation tests, obtained from N = 1000 samples of 
size n, = 50 of bivariate Gaussian distributions. For this setting, we can see form Figure 3, that 
our poposed KL-MI independence test is almost as powerful as the most uniformly powerful 
independence test of Pearson. y^-MI, Spearman and Kendall tests have comparable powers, 
lower than KL-MI and Pearson’s ones. 



Figure 3. Gomparison of powers of KL-MI and y^-MI tests with noncorrelation 
tests of Pearson, Spearman and Kendall. 
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5.3. Comparison of (^-MI based tests for a copula density model. This Section aims 
at comparing numerically the (p-MI based independence tests in the context of semiparamet- 
ric copula-type model, as described in Example 2.5. We consider here the Farlie-Gumbel- 
Morgenstern (FGM) copula model 

CFGM{u,v-,e) = uv{l + 6{l-u){l-v)), (m,!;) e [0,1]^ 6'e0 = [-l,l], 

with 6q = 0. We compare the powers of KL-MI and y^-MI based tests of independence to 
noncorrelation ones. We consider the alternative hypothesis that X and Y are uniformly 
distributed on [0,1] and copulated by a FGM copula. We consider values of the parameter 9 of 
the form 6 = k/16, with k G {0,..., 16}. We have estimated the critical values of the KL-MI 
and x^-MI tests using the bootstrap procedure presented in Section 3.3, from an original sample 
of size 77, = 50 resampled 10 000 times. The powers are computed by Monte-Garlo method from 
N = 5000 samples of size n = 50. The results are presented in Table 3. We can see again 
that KL-MI based test still outperforms the others. We can see also that the nominal levels (of 
KL-MI and y^-MI test statistics) are sufficiently close to the test levels evaluated through the 
bootstrap procedure described in Section 3.3, with a = 0.05. 


e 

0 

1/16 

2/16 

3/16 

4/16 

5/16 

6/16 

7/16 

KL-MI 

0.062 

0.061 

0.064 

0.076 

0.093 

0.120 

0.142 

0.171 


0.054 

0.055 

0.057 

0.066 

0.084 

0.108 

0.129 

0.160 

Pearson 

0.052 

0.057 

0.061 

0.072 

0.089 

0.113 

0.135 

0.170 

Spearman 

0.055 

0.058 

0.060 

0.069 

0.086 

0.110 

0.133 

0.164 

Kendall 

0.056 

0.057 

0.057 

0.069 

0.086 

0.111 

0.130 

0.161 


9 

8/16 

9/16 

10/16 

11/16 

12/16 

13/16 

14/16 

15/16 

1 

KL-MI 

0.219 

0.261 

0.312 

0.382 

0.431 

0.498 

0.565 

0.622 

0.691 

X^ 

0.202 

0.244 

0.296 

0.362 

0.404 

0.472 

0.527 

0.589 

0.659 

Pearson 

0.213 

0.257 

0.309 

0.375 

0.427 

0.493 

0.549 

0.611 

0.677 

Spearman 

0.207 

0.249 

0.300 

0.369 

0.410 

0.478 

0.533 

0.596 

0.663 

Kendall 

0.203 

0.243 

0.293 

0.356 

0.405 

0.467 

0.527 

0.584 

0.647 


Table 3. Power functions of KL-MI and y^-MI tests compared to noncorrelation 
tests obtained from N = 5000 samples of size n = 50 of the FGM copula with 
parameter 9 varying from 0 to 1 by step of 1/16. 


6. Concluding remarks and discussion 

In this paper, we have dehned and studied estimates of <y9-mutual informations, based on the dual 
representation of (p-divergences and a semiparametric modeling of the density ratio between the 
joint distribution of the couple and the product distribution of its margins. The consistency 
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of these estimates - named dual-estimates - has been established assuming some classical 
regularity conditions on the model; the asymptotic normality has been established for classical 
Kullback-Leibler mutual information and specihc models by means of classical M-estimation 
theory arguments. The asymptotic normality of other (p-mutual information dual-estimates 
may be derived similarly, for specific models depending on the considered <p-divergence. For 
example, when dealing with the power divergence associated to functions given by ( 6 ), 
the asymptotic normality of the corresponding (^..^-mutual-information dual-estimates may be 
derived in a similar way when focusing on the so-called 7 -exponential semiparametric model 



P G < P G Mi{X X 3^) such that y) = exp^ 


where exp..j,(t) := ((7 — l)t + with (■)+ = max(0, •). Our semiparametric approach for 

estimating mutual informations constitutes a promising alternative to classical nonparametric 
procedures based on kernel density estimation or adaptive partitioning. No parameters such as 
bandwidth or kernel type has to be adjusted. The asymptotic normality of dual-estimates is also 
of significative importance, particularly, for hypothesis-testing purpose. For the sake of both 
completeness and accessibility, we are developing a package for the R software providing user- 
ready procedures, including the fc-fold cross validation procedure described in Section 2.4, for 
selecting the model that best matches the data. We also aim at comparing the dual-estimates 
of mutual informations to nonparametric estimates. As an application of dual-estimation of 
mutual informations, we have derived a class of independence tests, recovering as a particular 
case, the classical y^-independence test. For a large variety of situations including hnite-discrete 
random couples, the most efficient test is based on the KL-MI estimates, outperforming the 
classical y^-independence one. Motivated by the simulation experiments presented in this 
paper, we guess that the optimality of KL-MI independence test can be extended to a larger 
family of models. 


7. Appendix 


Proof of Proposition 3.1. Using continuity of ge{x,y) in 9 on the compact set 0, and condition 
(A.7), we can state, by Bienayme-Tchebychev inequality, the uniform convergence in probability 



^ 0 . 


(35) 
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Under condition (A.6), using continuity of fo{x,y) in 6 over the compact set 0, we have by 
uniform weak law of large numbers the convergence in probability 


Now, we have 


Bn := sup 
eee 


fe{x,y)d^{x,y) - / fe{x,y)d^{x,y) 


0 . 


7 - rm 

= 

sup Mn (0) — sup M (9) 



eee eee 


:=|a 


with 

Cn,L := - M(0t) <Cn< - M%) =: Cn^R. 

We can see that both sides converge in probability to zero, since 


(36) 


(37) 


< An + Bn and < An + Bn 

and the use of convergences (35) and (36). We conclude that —)■ /<^(P) in probability. The 
convergence of Otp to Or holds by direct application of Theorem 5.7 in van der Vaart (1998), 
using the uniform convergence in probability 


sup|M„(0) -M(0)| ^ 0 
eee 


and the well-separability of the supremum Or] it is unique and interior point of 0. 


□ 


Proof of Theorem 3.3. (a) Direct calculus gives 

P/'-P^(7'=0 (38) 

and 

P/o - (Kho) = -Si. (39) 

Observe that the above matrix Si is symmetric and positive. 

For any 0 G 0, we have M'niO) = P/g — Note that 

fo{x,y) = go{x,y) = {1, fi{x)Ci{y),..., fd{x)Cdiy)V ■ 

We will state the asymptotic normality of y/nMn{0) using the multivariate Delta method. So 
consider the random column vector in 

u(x, u) := (1, ei(x),..., ed(x), Ci(u),..., o(u), ei(^)Ci(>^), • • •, ed(^)Cd(>^))^ ■ 

Denote by 

p := E(U(A, y)) = ( 1 , Pi^i,..., FiU P 2 C 1 , • • •, P 2 Cd, P 16 P 2 C 1 , • • •, Pi^AO)^ 
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which is a column vector in Then we have, by multivariate central limit theorem, the 

convergence in distribution 

V{Xi, Fi) - /i) ^ M+3d (0, S), 


i=l 


with E = E ((1/(X, y) —/i)(l/(X, y) —/i)"'"), from which we obtain, by multivariate Delta 
method, 

(M;(0) - ^ (0, S2 := V''(/i)Ei/^'(/i)^) , (40) 

where is the function dehned on into by 

'^(xo, xi,... ,Xd,yi,... ,yd,zi,..., Zd) = (0, xiyi - zi,..., Xdyd - Zd)^ 

which is of class C^. Note that = 0, the hrst component of M^(0) is equal to zero for all 
n and that the hrst column and row of the limiting covariance matrix E 2 are equal both to 0. 
Whence we have the convergence in distribution 

^/^M;(0) ^M+,(0,E2). (41) 

By Taylor expansion of Un{6ipi) in around 6^ = 0, using condition (A.8) and the convergence 
in probability of 9^^ to 9t = 0, we obtain 

0 = Af'(»^,) = M;(0) + M;(0)»^, + o,(l)«„.. (42) 

On the other hand, by (A.9), we can write 

M:(0) = P/" - P^(?" + op(l) = -El + op(l). 

Combining the last two displays, leads to 

Af;(0) = (Ei + t>,(l))«^.. (43) 

We have, from (41), that ^/nM^{0) = Op(l), which by (43) implies that = Op(l). 

Combining this last result with the relation (42), we obtain 

^9^, = E^'^/^M;(0) + op(l). (44) 

Use this last relation and (41) to conclude the proof of part (a). 


(b) By Taylor expansion of = Mn{9^j^), in 9^^ around 9t = 0, using the fact that Mn(0) = 0 
and some of the above statements, we obtain 
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which by (44) leads to 

2nT^, = (V^M;(0))^Er'V^M;(0) + op(l) (45) 

= (^/^S-'/'m;( 0))^S-'/'V^M;(0) + op(l). (46) 

This proves the convergence in distribution of 2nJ^^ to the random variable Z~^Z, where Z 
is a centered multivariate normal random variable with covariance matrix C = □ 


Proof of Theorem f.l. First, under assumption (A.10), Eichelsbacher and Schmock (2002) 
yields the following large deviations principle for the sequence (P)n of empirical measures : we 
have for all measurable subset B of Aig, 

liminf — logP"*" fp G-B) > — inf K((5,P''‘), (47) 

n^oo n V / Qg Int-rg {B) 

and 

limsup — log P"*" (^P G< — inf ]K(Q,P'*'), (48) 

n^oo n \ J QGCLg(B) 

where IntT-g(-B) and C4g(-B) denote, respectively, the interior and closure of B, with respect to 
the Tg-topology. Since Q G Aig i—)■ V^p{Q, Q^) is continuous, we obtain by contraction principle 
from (47) and (48), for all d > 0, 

liminf — logP'*' (b > d) > —inf {K(Q,P-‘-); Q G Aig and T>,n{Q,Q'^) > d] (49) 

and 

limsup — logP"*" > d'j < —inf |]K(Q,P-*-); Q G Aig and T>ip{Q,Q'^) > d} . (50) 

n^oo ^ ' 


We now prove that the function e(^(-) : d G •—)■ inf{K(Q, P-*-); Q G Aig and T>ip{Q,Q'^) > 

d} G [0, +oo] is right-continuous so that infima in (49) and (50) are equal, yielding (30). So, let 
d > 0 be any positive real number, and show that e<^(-) is right-continuous at d. If no Q G Qd 
exists such that K((5,P'*') < -l-C) 0 , obviously, since for any d' G such that d < d', we have 
^d' ^ ^d, then both e^{d) and e^{d') equal oo, which implies that is right-continuous at d 
in this case. Now, assume that some Q E Qd exists such that ]K(Q,P-‘-) < oo. Two cases can 
be handled separately. First, assume that the infimum (31) is achieved for some Q such that 
V^{Q,Q^) =: d' > d. Then, for all d" satisfying d < d" < d', the equality e<^(d") = e^(d) holds, 
yielding the right-continuity of at d. Second, assume that the infimum (31) is achieved for 
Q such that V^{Q,Q-^) = d. Let us prove that there exists a sequence (Qn)n of elements of 
{Q : V^{Q,Q-^) > d} such that K(Q„,P-‘-) K((5,P'*‘) yielding right-continuity of e(^(-) at 
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d. We build such a sequence (Qn)n such that has the same marginal distributions as Q, i.e., 
Qn,i = Qi and Qn ,2 = Q 2 - We have then Q;!: = Q-^. Let 


6 := argsup 
eee 


ip'{he) dQ 


ip*{ip'{hg)) dQ 


T 


SO that 

J (p'ihg) dQ - J if* {(p'{hg)) dQ-^ = V^{Q, Q^) = d > 0. (51) 

Denote Q the image distribution on the Borel a-field (M, i3(M)) of Q by the function ip'{hg). Let 
us prove by contradiction that Q can not be Dirac measure, by making use of either (A.12.a) 
or (A.12.b). If Q was a Dirac measure, necessarily ^'{hg) would be Q-a.s. constant, i.e., hg 
would be Q-a.s. constant 

Ml ■) = D Q-a.s. (52) 


Now, if (A.12.a) holds, we can consider the set of all finite-discrete distributions with the same 
finite support A x A’, instead of the set M.g. Hence, Q and Q^ have same support, so that 
(52) implies that 

Ml ■) = D Q'^-a.s. (53) 


Combining (52), (53) and (51), we obtain 


9 ?(c) -(- <yc'(c)(l — c) = d > 0. 


(54) 


On the other hand, by convexity of and the fact that V9(l) = 0, we get 

0 = </?(!) > <yc(c) -1- </?'(c)(l - c) = d, 

which contradicts the fact that d > 0. Alternatively, assume that (A.12.b) holds. Note that, 
under this assumption in connection with (A.2), we can see that the value 9q (of the parameter 
corresponding to independence) is necessarly 6q := (uo, M)"'' = (0, 0,..., 0)"''. We can see also, 
by contradiction as above, that 9 can not be of the form {a, 0,..., 0)^ with a ^ 0. Hence, it 
can be written as 

9={a,(f^)~^ with /3 7 ^ (0,..., 0)"''. (55) 

Now, by (52), using the fact that hg{-, •) is of the form exp (a -|- •)), we get that 

m^(-, •) = cte, Q-a.s. (56) 

Note that the support of Q^ is included in that of P-*- (if not, Q would not be a.c.w.r.t. P-*“ 
and ]K(Q,P-*-) would not be finite). Hence, (56) implies that P“*“(m^(X, F) = cte) ^ 0, which 
in turn implies that /? = (0,..., 0)"'" by assumption (A.12.b). This contradicts (55). We have 
proven then that Q is not a Dirac measure. So, there exist A, B two measurable subsets of 
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= Im( 9 ?') such that Q{A) > 0, Q{B) > 0 and a := inf A > fe := sup(i?). Denoting 
g := dQ/dQ^ the density of Q with respect to the product of its marginal distributions, set 

9n ■= (^1 + g'^W'ihj)eA} + (l- gl{^ghj)eB} + 9'^{^p'{h-g)&AuB} 

, Cl C2 

= 9 + —9^W{hj)GA} - —9^{p'ihj)eB}, 

where ci := Q{B) and C 2 := Q{A). Note that gn is nonnegative for n sufficiently large, and 
that gn{x,y) dQ^{x,y) = 1. Then, let Qn be the probability distribution on X xy such 
that Qn,i = Qi, Qn ,2 = Qi and dQn/dQ^ = gn- We have 


'xxy 


(f'ihg) dQn = 


> 


/ v'{.hg)9ndQ^ 

'xxy 

I cp'ih^) dQ + ^EQ(Id.l^) - ^EQ(Id.lB) 

f ^'(h) dQ + -aQ{A) - -hQ{B) 

I n n 

[ (p'{h-g) dQ + - b) 


> 


ip'{hg) dQ, 


where Id(a;) := x, for all x G {a^*,b^*). Then, 

'Dp{Qn,Qn) = I v\hs)dQr,- I ip*{^'{h^))dQ^ 

> j ^'ih)dQ- J ^*{ip'{hg))dQ^ 

= d. 


Finally, the convergence of K((5n,E''') to ]K(Q,P-*“) can be proved using the decompositions 

K(Q„,p^) = K(g„, Q^) + K(g^,p^), K(g,p^) = K(g, g^) + k{q^, p^), 

and Lebesgue’s dominated convergence theorem. □ 
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